Collaborative Filtering in a Non-Uniform World: 
Learning with the Weighted Trace Norm 
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Abstract 

We show that matrix completion with trace- 
norm rcgularization can be significantly hurt 
when entries of the matrix are sampled non- 
uniformly. We introduce a weighted version 
of the trace-norm regularizer that works well 
also with non-uniform sampling. Our experi- 
mental results demonstrate that the weighted 
trace-norm regularization indeed yields sig- 
nificant gains on the (highly non-uniformly 
sampled) Netflix dataset. 



1. Introduction 

Trace-norm regularization is a popular approach for 
matrix completion and collaborative filtering, mo- 
tivated both as a convex surrogate to the rank 
(|Fazel et all . 120011 ICandes fc Taol . 120091 ) and in terms 
of a regularized infinite factor model with con- 
nections to large-ma r gin n o rm-re g ularized learning 



( Srebro et all l2005bt 



Bach 



20091 : ISalakhutdinov fc Mnih 



2008) 



2008; lAbernethv et al 



Current theoretical guarantees on using the trace 
norm for matrix completion all assume a uni- 
form sampling distribution over entries of the ma- 
trix dSrebro fc Shraibmanl 120051: ICandes fc Taoll200£; 



Candes fc Rechtl . l2009t ICandes fc Taol . 120091 fRecht 
2009). In a collaborative filtering setting, where rows 
of the matrix represent e.g. users and columns rep- 
resent e.g. movies, this corresponds to assuming all 
users are equally likely to rate movies and all movies 
are equally likely to be rated. This of course cannot be 
further from the truth, as in any actual collaborative 
filtering application, some users are much more active 
then others and some movies are rated by many people 
while others are much less likely to be rated. 

In Section [3] we show, both analytically and through 



simulations, that this is not a deficiency of the proof 
techniques used to establish the above guarantees. In- 
deed, a non-uniform sampling distribution can lead to 
a significant deterioration in prediction quality and an 
increase in the sample complexity. Under non-uniform 
sampling, as many as f2(n 4 / 3 ) samples might be needed 
for learning even a simple (e.g. orthogonal low rank) 
n x n matrix. This is in sharp contrast to the uniform 
sampling case, in which 0(n) samples are enough. It 
is important to note that if the rank could be mini- 
mized directly, which is in general not computation- 
ally tractable, 0(n) samples would be enough to learn 
a low-rank model even under an arbitrary non-uniform 
distribution. 

In Section [4] we suggest a correction to the trace- 
norm regularizer, which we call the weighted trace- 
norm, that takes into account the sampling distribu- 
tion. This correction is motivated by our analytic anal- 
ysis and we discuss how it corrects the problems that 
the unweighted trace-norm has with non-uniform sam- 
pling. We then show how the weighted trace-norm in- 
deed yields a significant improvement on the (highly 
non-uniformly sampled) Netflix dataset. 

2. Complexity Control in terms of 
Matrix Factorizations 

Consider the problem of predicting the entries of some 
unknown target matrix Y € R nxm based on a random 
subset S of observed entries Yg. For example, n and 
m may represent the number of users and the number 
of movies, and Y may represent a matrix of partially 
observed rating values. Predicting elements of Y can 
be done by finding a matrix X minimizing the train- 
ing error, here measured as a squared error, and some 
measure c(X) of complexity. That is, minimizing ei- 
ther: 



mm\\X s -Y s r F 



\c(X) 



(1) 
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or: 



min \\X s -Y s \r F , 

c(X)<C 



(2) 



where Ys, and similarly Xg, denotes the matrix 
"masked" by S: 



<Y s )i 



Y itj if(i,j)eS 
otherwise. 



(3) 



For now we ignore possible repeated entries in S. We 
will also assume that n < m without loss of generality. 

The two formulations (QJ and ([2]) are equivalent up 
to some (unknown) correspondence between A and C, 
and we will be referring to them interchangeably at 
our convenience. 

2.1. Low Rank Factorization 

A basic measure of complexity is the rank of X, cor- 
responding to the minimal dimensionality k such that 
X = U T V for some U € R fcx " and V 6 



p k x m 



Directly constraining the rank of X forms one of 
the most popular approaches to collaborative filter- 
ing. Training such a model amounts to finding the 
best rank-fc approximation to the observed target ma- 
trix Y under the given loss function. However, the 
rank is non-convex and hard to minimize. It is also 
not clear if a strict dimensionality constraint is most 
appropriate for measuring the complexity. 

2.2. Trace- norm Regularization 

Lately, methods regularizing the norm of the fac- 
torization U T V, rather then its dimensionality, have 
been advocated and were shown to enjoy con- 
siderable empirical success (IRennie k, Srebr 
Salakhutdinov fc Mnihl . l2008h . This is captured by 
measuring complexity in terms of the trace-norm of X, 
which can be defined equivalcnt ly either as t he su m of 



the singular values of X, or as (jFazel et all |2001[ ): 



\x\L=™$ v l(\\u\\l + \\v\\ 



(4) 



Note that the dimensionality of U and V in Q is not 
constrained. Beyond the modeling appeal of norm- 
based, rather then dimension-based, regularization, 
the trace-norm is a convex function of X and so can be 
minimized by either local search or more sophisticated 
convex optimization techniques. 

2.3. Scaling of the Trace-norm 

It will be useful for us to consider the scaling of the 
trace- norm with the size of the matrix X. This will 



allow us, for example, to understand the magnitude of 
the bound C we can expect to put on the trace-norm 
in the formulation . 

The rank, as a measure of complexity, does not scale 
with the size of the matrix. That is, even very large 
matrices can have low rank. Viewing the rank as a 
complexity measure corresponding to the number of 
underlying factors, if data is explained by e.g. two fac- 
tors, then no matter how many rows ("users") and 
columns ("movies") we consider, the data will still 
have rank two. 

The trace-norm, however, does inherently scale with 
the size of the matrix. To see this, note that the trace- 
norm is the l\ norm of the spectrum, while the Frobe- 
nius norm is the £2 norm of the spectrum, yielding: 



||X|| F < ||X|| tr < ||X|| F vW) < n \\X\ 



p Y ' Li'">-v vi J — IK V II F ' (5) 

where in the second inequality we used the fact that 
the number of non-zero singular values is equal to the 
rank. The Frobenius norm certainly increases with the 
size of the matrix, since the magnitude of each element 
does not decrease when we have more elements, and 
so the trace-norm will also increase. The above sug- 
gests measuring the trace-norm relative to the Frobe- 
nius norm. Without loss of generality, consider each 
target entry to be of roughly unit mag nitud«H e.g. ±1, 
and so in order to fit Y each entry of X must also be 
of roughly unit magnitude. This suggests scaling the 
trace-norm by ^nm. More specifically, we study the 
trace-norm through the complexity measure: 



tc(X) 



\X 



(6) 



which puts the trace-norm on a comparable scale to 
the rank. In particular, when each entry of X is, on- 
average, of unit magnitude (i.e. has unit variance), in 
which case IIXIL = Jnm. we have: 



1 < 



< tc(X) < rank(X) < 



n. 



{7) 



To further understand the trace-norm complexity 
control, consider "orthogonal" low-rank matrices 
U G M fcx " and V G R kxrn , such that Y = U T V and 
where the entries of U and V are i.i.d. Af(0, l/V*jE 
The matrix Y is then of rank k, with each entry having 
zero mean and unit variance (magnitude). Its Frobe- 
nius norm is tightly concentrated at ||F|| F = \Jnra. 



x Any other constant magnitude will only result in some 
constant scaling 

2 The important issue here is the orthogonality and the 
norm uniformity, not the randomness. But we find it easier 
to think of the orthogonality in terms of an i.i.d. random 
model. 
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Since rows of U and V are orthogonal, this is essen- 
tially the singular value decomposition, with all k sin- 
gular values being equal to yj nm/k. We thus have 
tc(X) = k. And so at least in the orthogonal case, 
tc(X) = rank(X). 

Another place where we can see that tc(X) plays a 
similar role to rank(X) is in the generalization and 
sample complexity guarantees that can be obtained 
for low-rank and low-trace-norm learning. Such learn- 
ing guarantees were mostly discussed in the context of 
Lipschitz continuous loss functions (i.e. functions with 
a bounded first derivative), rather then the squared 
loss. The squared loss has a bounded second derivative 
rather then bounded first derivative and so requires 
somewhat different technical tools. Nevertheless, the 
main thrust of the results is still valid. 

For Lipschitz continuous loss functions, if there is a 
low-rank matrix X* achieving low average error rel- 
ative to Y (e.g. if Y = X* + noise), then by min- 
imizing the training error subject to a rank con- 
straint (a computationally intractable task), \S\ = 
0{rank{X*)(n + m)) samples are enough in order to 
guarantee learning a matri x X whose overall a verage 
error is close to that of X* (jSrebro et aUl2005al ). Sim- 
ilarly, if there is a low-trace-norm matrix X* achieving 
low average error, then minimizing the training error 
and the trace- norm (a convex optimization problem), 
\S\ = 0{tc{X*){n + m)) samples are enough in order 
to guarantee learning a matrix X whose overall aver- 
age e rror is close to that of A* (jSrebro fc Shraibman . 



20051 ) . In these bounds tc(X) plays precisely the same 
role as the rank, up to logarithmic factors. 

Without getting into the technical tools required to 
rigorously establish the above sample complexity guar- 
antees, it is useful to understand them at a more ab- 
stract level. In order to understand the guarantees for 
low-rank learning, it is enough to consider the number 
of parameters in the rank-fc factorization A = U T V. 
It is easy to see that the number of parameters in the 
factorization is roughly k(m + n) (perhaps a bit less 
due to rotational invariants). And so we would expect 
to be able to learn A when we have roughly this many 
samples, as is indeed confirmed by the rigorous sample 
complexity bounds. 

For low-trace-norm learning, consider a sample S of 
size 15*1 < Cn, for some constant C. Taking entries of 
Y to be of unit magnitude, we have H^sHf = \/\S~\ = 
VCn (Recall that Ys is defined to be zero outside S). 
From ((5} we therefore have: ||!s|| tl . < yCn ■ y/n = 
\fCn and so tc(Ys) < C. That is, we can "shatter" 
any sample of size \S\ < Cn with fc(A) = C: no mat- 
ter what the underlying matrix Y is, we can always 



perfectly fit the training data with a low trace-norm 
matrix A s.t. tc(X) < C, without generalizing at all 
outside S. On the other hand, we must allow matrices 
with tc(X) = tc{X*), otherwise we can't hope to find 
X* . and so we can only constrain tc(X) < C — tc(X*). 
We therefore cannot expect to learn with less then 
ntc(X*) samples. It turns out that this is essentially 
the largest random sample that can be shattered with 
tc(X) < C = tc(X*), and that if we have more then 
this many samples we can start learning. For our 
purposes here, we will mostly just make use of non- 
learnability arguments of this form: if we can shatter 
a random sample of size | *S* | with a matrix A have the 
same complexity (e.g. trace-norm) as our target matrix 
A*, we cannot hope to learn without a larger sample. 

3. Trace-Norm Under a Non-Uniform 
Distribution 

In this section, we will analyze trace-norm regularized 
learning when the sampling distribution is not uni- 
form. That is, when there is some, known or unknown, 
non-uniform distribution T> over entries of the matrix 
Y (i.e. over index pairs and our sample S is 

sampled i.i.d. from T>. Of course, if T> concentrates on 
only a small subset of the matrix, we have no hope of 
recovering rows and columns of Y on which we have 
zero probability of seeing an observation. Instead, our 
objective here, as is typically the case in learning un- 
der an arbitrary distribution, is to get low average er- 
ror with respect to the same distribution T>. That is, 
we measure generalization performance in terms of the 
weighted sum-squared-error: 



\X - Y ||p - E (i,i)~x> [i x n ~ Y n) 2 ] 



(8) 



We first point out that when using the rank for com- 
plexity control, i.e. when minimizing the training er- 
ror subject to a low-rank constraint, non-uniformity 
does not pose a problem. The same generalization and 
learning guarantees that can be obtained in the uni- 
form case, also hold under an arbitrary distribution 
T>. In particular, if there is some low-rank A* such 
that || A* - Y\\ v is small, then 0(rank(X*)(n + m)) 
samples are enough in order to learn (by minimiz- 
ing training error subject to a rank constraint) a ma- 
trix A with ||A — Y\\t~, almost as small as ||A* — Y\\ v 



(Srebro et al. 2005a 



8 



However, the same does not hold when learning us- 



3 Actually, this is shown only for Lipschitz continuous 
loss functions, and not for the squared-loss, but at the very 
least this holds if X is appropriately clipped. Since for- 
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Figure 1. The two submatrices A of size tia = n a and B 
of size ub — n/2. 

ing the trace-norm. To see this, consider an orthog- 
onal rank-fc square n x n matrix, and a sampling 
distribution which is uniform over an tia x tia sub- 
matrix A, with ha = n a (sec Fig. [T]). That is, the 
row (e.g. "user") is selected uniformly among the first 
tia rows, and the column (e.g. "movie") is selected 
uniformly among the first tia columns. We will use 
A to denote the subset of entries in the submatrix, 
i.e. A = {(i,j)|l < i,j < tia}, rather then the matrix 
itself, and so we can say that V is uniform on A. For 
any sample S, we have: 

, (v , \\Ys\\l < \\Y s \\lrank(Y s ) 
jSjn- _ _jSl 

where we again take the entries in Y to be of unit 
magnitude. In the second inequality above we use the 
fact that Ys is zero outside of A, and so we can bound 
the rank of Ys by the dimensionality ua = n a of A. 

Setting a < 1, we see that we can shatter any sam- 
ple of siz43 kn 2 ~ a = u)(n) with a matrix X for which 
tc(X) < k. When a < 1/2, the total number of 
entries in A is less then n, and so 0{n) observa- 
tions are enough in order to memorize Ya- But when 
1/2 < a < 1, with 0(n) observations, restricting to 
even tc(X) < 1, we can neither learn Y, since we 
can shatter Ys, nor memorize it. For example, when 
a = 2/3 and so ua = n 2 / 3 , we need roughly n 4 / 3 to 
start learning by constraining tc(X) to a constant — 
the same as we would need in order to memorize Ya- 
This is a factor of n 1 / 3 greater then the sample size 
needed to learn a matrix with constant tc(X) in the 
uniform case. 

The above arguments establish that restricting the 
complexity to tc(X) < k might not lead to general- 
ization with 0(kn) samples in the non- uniform case. 
But does this mean that we cannot learn a rank-fc ma- 

mal guarantees are not the focus of this paper, we rather 
view this statement only as an indicative statement with- 
out stating it rigorously. 

4 Recall that f(n) — Cj(g(n)) is the same as g(n) — 
o(f(n)) and means that for all p we have 9(n ' 1 l"^ 9 ^ — > 0. 



trix by minimizing the trace-norm using O(kn) sam- 
ples when the sampling distribution is concentrated 
on a small submatrix? Of course this is not the case. 
Since the samples are uniform on a small submatrix, 
we can just think of the submatrix A as our entire 
space. The target matrix still has low rank, even when 
restricted to A, and we are back in the uniform sam- 
pling scenario. The only issue here is that tc(X) < k, 
i.e. ||A|| tr < nyk, is the right constraint in the uni- 
form observation scenario. When samples are concen- 
trated in tla, we actually need to restrict to a much 
smaller trace norm, ||X|| tr < n a \fk, which will allow 
learning with 0(kn a ) samples. 

It is, however, easy to modify the above example and 
construct a sampling distribution under which J7(n 4 / 3 ) 
samples are required in order to learn even an "or- 
thogonal" low-rank matrix, no matter what constraint 
is placed on the trace-norm. This is a significantly 
large sample complexity then O(kn), which is what 
we would expect, and what is required for learning by 
constraining the rank directly. 

To do so, consider another submatrix B of size tib X tib 
with tib = n/2, such that the rows and columns of A 
and of B do not overlap (Fig. [J). Now, consider a 
sampling distribution T> which is uniform over A with 
probability half, and uniform over B with probability 
half. Consider fitting a noisy matrix Y = X* + noise 
where X* is "orthogonal" rank-fc. In order to fit on 
B, we need to allow a trace- norm of at least ||AJj|| ti . = 
^Vfc, i.e. allow tc(X) = fc/4. But as discussed above, 
with such a generous constraint on the trace-norm, we 
will be able to shatter S C A whenever \S PI A | = 
\S\/2 < k/4n 2 ~ a . Since there is no overlap in rows 
and columns, and so values in the sub-matrices A and 
B are independent, shattering Sn A means we cannot 
hope to learn in A. Setting a = 2/3 as before, it seems 
that with o(n 4 / 3 ) samples, we cannot learn in both A 
and B: either we constrain to a trace-norm which is 
too low to fit X* B (we under- fit on B), or we allow 
a trace- norm which is high enough to overfit YsnA- 
Either way, we will make errors on at least half the 
mass of 2?0 

Figure left panel, precisely illustrates this phe- 
nomenon on a simulation experiment. For this syn- 
thetic example, we used tia = 300 and tib = 4700, 

J To make the above argument more precise, we should 
note that if we do allow high enough trace-norm to fit B, 
and \S\ = o(n 4 / 3 ), then the "cost" of overfitting YsnA is 
negligible compared to the cost of fitting X B - For large 
enough n, we would be tempted to very slightly deteriorate 
the fit of X B in order to "free up" enough trace-norm and 
completely overfit YsnA- 
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Figure 2. Mean squared error (MSE) of the learned model as a function of the constraint on tc(X) (left) and tc pq (X) 
(right). The black (middle) curve is the overall MSE error, the red (bottom) curve measures only the contribution from 
A, and the blue (top) curve measures only the contribution from B. 



with an orthogonal rank-2 matrix X* and Y = X* + 
Af(0, 1) (in case of repeated entries, the noise is in- 
dependent for each appearance in the sample). The 
training sample size was also set to |S|=140,000. 

The three curves of Fig. [3] measure the excess (test) 
error \\X - X*\\ 2 V = \\X - Y\\ 2 V - \\Y - X*f v of the 
learned model, as well as the error contribution from A 
and from B, as a function of the constraint on tc(X), 
for the sampling distribution discussed above and a 
specific sample size. As can be seen, although it is 
possible to constrain tc(X) so as to achieve squared- 
error of less then 0.8 on B, this constraint is too lax 
for A and allows for over-fitting. Constraining tc(X) so 
as to avoid ovcrfitting A (achieving almost zero excess 
test error), leads to a suboptimal fit on B. 

Until now we discussed learning by constraining the 
trace-norm, i.e. using the formulation @. It is also 
insightful to consider the penalty view ([!}, i-e. learning 
by minimizing 

mm\\Y s - X s \\l + A||X|| tr . (10) 

First observe that the characterization ^ allows us 
to decompose ||A|| tr = ||A A j| tl . + ||A B j| tr , where 
w.l.o.g. we take all columns of U and V outside A 
and B to be zero. Since we also have \\Ys — Ag|| F = 

\\Y AnS - X AnS \\l + \\Y Bn s ~ A B ns|lF> we can decom- 
pose the training objective (fTU|) as: 

||Ys-A s ||2 +A||A|| tr 
= (\\Y AnS - X AnS \\l + X\\X A \\ tr ) 
+ (likens -A Bn s||p + A ||A B || tr ) 

= (\\Y An s - AahsIIf + Att-aV tc A (X A )^j 

+ (\\Y B ns - X BnS \\l + Xn B ^/tc B {X B )) , (11) 



where tc A (X A ) = \\XA\\ tT /n\ (and similarly tc B (X B )) 
refers to the complexity measure tc(-) measured rela- 
tive to the size of A (similarly B). We see that the 
training objective decomposes to a trace-norm regular- 
ized problem in A and a trace-norm regularized prob- 
lem in B. Each one of these problems is a trace-norm 
regularized learning problem, under a uniform sam- 
pling distribution (in the corresponding submatrix) of 
a noisy low-rank "orthogonal" matrix, and can there- 
for be learned with 0(kn A ) and 0(kn B ) samples re- 
spectively. In other words, 0(kn) samples should be 
enough to learn both inside A and inside B. 

However, the regularization tradeoff parameter A com- 
pounds the two problems. When the objective is ex- 
pressed in terms of tc(-), as in (fTTj) . the regularization 
tradeoff is scaled differently in each part of the train- 
ing objective. With 0(kn) samples, it is possible to 
learn in A with some setting of A, and it is possible 
to learn in B with some other setting of A, but from 
the discussion above we learn that no single value of 
A will allow learning in both A and B. Either A is too 
high yielding too strict regularization in B, so learn- 
ing on B is not possible, perhaps since it is scaled by 
us » n A . Or A is too small and does not provide 
enough regularization in A. 

Returning to our simulation experiment, the solid 
curves of Fig. [3] show the excess test error for the 
minimizer of the training objective (jllj) . as a func- 
tion of the regularization tradeoff parameter A. Note 
that these arc essentially the same curves as displayed 
in Fig. [21 except the path of regularized solutions is 
now parameterized by A rather then by the bound on 
tc(X). Not surprisingly we see the same phenomena: 
different values of A are required for optimal learning 
on A and on B. Forcing the same A on both parts of 
the training objective yields a deterioration in the 
generalization performance. 
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10 



10 10 
Regularization parameter X 



10 



Figure 3. The solid curves show the optimum of the mean 
squared error objective (unweighted trace- norm), as 
a function of the regularization parameter A. The dashed 
curves display a weighted trace-norm. 



4. Weighted Trace Norm 

The decomposition (jlip and the discussion in the pre- 
vious section suggests weighting the trace-norm by the 
frequency of rows and columns. For a sampling distri- 
bution T>, denote by p(i) the row marginal, i.e. the 
probability of observing row i, and similarly denote 
by q(j) the column marginal. Wc propose using the 
weighted version of the trace- norm as a regularizer: 



\X 



tr(p,g) 



diag( v / p)Xdiag( v ^)|| tr 



ir, 



(12) 



where diag( v /p) is a diagonal matrix with \Jp{i) on 
its diagonal (similarly diag(y^)). The corresponding 
normalized complexity measure is given by tc Ptq (X) = 



\X\ 



tr(p,q)- 



Note that for a uniform distribution wc 



have that tc p ^ q {X) = tc(X). Furthermore, it is easy 
to verify that for an "orthogonal" rank-fc matrix X we 
have tCp. q (X) = k for any sampling distribution. 

Equipped with the weighted trace-norm as a regular- 
izer, let us revisit the problematic sampling distri- 
bution studied in the previous Section. In order to 
fit the "orthogonal" rank-k X*, we need a weighted 
trace-norm of ||-X"*|| tr ( Pj(? ) = yJtc PA (X) = \fk. How 
large a sample S PI A can we now shatter using such 
a weighted trace-norm? We can shatter a sample 
if ||XsnA|| tr < Vk. In order to calculate ||Fsrb4|| tr , 
recall that for £ A we have p{i) = q(j) = 

l/{2n A ). We can now calculate: ||^snj4 lltr(p,g) — 

= \\Y S n A \\ tI /(2n A ) < 



^l/{2n A )Y SnA ^\/{2n A ) 



\/\S H A\n A /(2n A ) = y/\S\/{8n A ). That is, we can 
shatter a sample of size up to \S\ = 8kn A < 8kn. 
The calculation for B is identical. It seems that now, 
with a fixed constraint on the weighted trace-norm, we 
have enough capacity to both fit X*, and with O(kn) 
samples, avoid overfitting on A. 

Returning to the penalization view ^ we can again 
decompose the training objective: 



mm lis 
x 



Y s - X, 



X\\X\ 



tr(p,g) 



(13) 



\Y s -X s \\l + X\\X\\ tI 



\Y AnS ~ X AnS \\ F + A ||-X"A|ltr(p,g)) 

A ll^"s|ltr(p,g)) 



+ (likens — XBns\\ F 
: (||y AnS - X AnS \\ 2 F + \/2^tc A {X A )) 

+ (\\Y B ns - X BnS \\l + X/2^tc B (X B ) 



(14) 



avoiding the scaling by the block sizes which we en- 
countered in (fTTj) . 

Returning to the synthetic experiments of Fig. G2 and 
comparing (jlip with (jT^J) , we see that introducing the 
weighting corresponds to a relative change of n A /n B 
in the correspondence of the regularization tradeoff pa- 
rameters used for A and for B. This corresponds to 
a shift of log ^ in the log-domain used in the figure. 
Shifting the solid red (bottom) curve by this amount 
yields the dashed red (bottom) curve. The solid blue 
(top) curve and the dashed red (bottom) curve thus 
represent the excess error on B and on A when the 
weighted trace norm is used, i.e. the training objec- 
tive (|14|) is minimized (except for an overall scaling 
in A). The dashed black (middle) curve is the over- 
all excess error when using this training objective. As 
can be seen, the weighting aligns the excess errors on A 
and on B much better, and yields a lower overall error. 
The weighted trace-norm achieves the lowest MSE of 
0.4301 with corresponding A = 0.11. This is compared 
to the lowest MSE of 0.4981 with A = 0.80, achieved 
by the unweighted trace-norm. It is also interesting to 
observe that the weighted trace-norm outperforms its 
unweighted counterpart for a wide range of regulariza- 
tion parameters A G [0.01; 0.6]. This may also suggest 
that in practice, particularly when working with large 
and imbalanced datasets, it may be easier to search for 
regularization parameters using weighted trace-norm. 
Fig. [21 right panel, further shows the test error as a 
function on the constraint tc p ^ q (X). 

Finally, Fig. [3] also suggests that the optimal shift is 
actually smaller then n A /n B . We consider a smaller 



Learning with the Weighted Trace Norm 



shift by using the partially-weighted trace-norm: 
ldiag(p Q/2 )Xdiag((j Q/2 )|| 



X 



tr(p,q,a) 



= x ™ft v ^(Epw q ii^ii 2 + E n^-ii 2 ) ( 15 ) 

i 3 

And he corresponding normalized complexity measure 

t c p,q,a{X) = H^"lltr( f° n , 

5. Practical Implementation 

When dealing with large datasets, such as the Netflix 
data, the most practical way to fit trace-norm regu- 
larized models is through stochastic gradie nt descent 
(ISalakhutdinov fc Mnihtl2008l; iKorenlhoOSl ). 



Let rii = i Sij an d 



Y li Sij denote the num- 



ber of observed ratings for user i and movie j respec- 
tively. The training objective (over the index pairs 
using partially- weighted trace-norm (Eq. |12[) 
can be written as: 



(16) 



2 V rii 



where U € R kxn and V G R fexm . We can optimize this 
objective using stochastic gradient descent by picking 
one training pair (i, j) at random at each iteration, and 
taking a step in the direction opposite the gradient of 
the term corresponding to the chosen 



Note that even though the objective (fTo)) as a func- 
tion of U and V is non-convex, there are no non- 
global local minima if we set fc to be large enough, 
i.e. fc > min(n,m) ( Burer fc Monteirol . 2005 ). How- 
ever, fitting orthogonal models in practice with very 
large values of k becomes computationally expensive. 
Instead, we consider truncated trace-norm minimiza- 
tion by restricting k to smaller values. In the next sec- 
tion we demonstrate that even when using truncated 
trace-norm, its weighted version significantly improves 
model's prediction performance. 

In all of our experiments, we also replace unknown 
row p(i) and column q(j) marginals in (|16[) by their 
empirical estimates p(i) — ni /\s\ and q(j) = m i/\S\. 
This results in the following objective: 



TT 'i) 9 



E [(YiJ-UfVi 

{i,j}£S V 

2\S\ 



(17) 



+^K~ 1 ii^ii 2 +™r 1 ii^ii 2 



Table 1. Model performance using Root Mean Squared Er- 
ror (RMSE) on the Netflix qualification set and the test set, 
that was randomly subsampled from the training data. 







RMSE 
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a 


k 


Test 


Qual 


k 


Test 


Qual 


1 


30 


0.7607 


0.9105 


100 


0.7412 


0.9071 


0.9 


30 


0.7573 


0.9091 


100 


0.7389 


0.9062 


0.75 


30 


0.7723 


0.9128 


100 


0.7491 


0.9098 


0.5 


30 


0.7823 


0.9159 


100 


0.7613 


0.9127 





30 


0.7889 


0.9235 


100 


0.7667 


0.9203 



Setting a = 1, corresponding to the weighted trace- 
norm (fl"2"]) . results in stochastic gradient updates that 
do not involve the row and column counts at all 
and are in some sense the simplest. Strangely, and 
likely originating as a "bug" in calculating the stochas- 
tic gradients by one of the participants, these are 
the actual SGD ste ps used by many practitioners on 
the Netflix dataset dKorenl. 120081: i Takacs et ahl . l2009t 
Salakhutdinov fc Mnihl . l2008h . 



6. Experimental results 

We evaluated various models on the Netflix dataset, 
which is the largest publicly available collaborative fil- 
tering dataset. The training set contains 100,480,507 
ratings from 480,189 randomly-chosen, anonymous 
users on 17,770 movie titles. As part of the training 
data, Netflix also provides qualification set, contain- 
ing 1,408,395 ratings. The pairs were selected from 
the most recent ratings for a subset of the users in the 
training dataset. Due to the special selection scheme, 
ratings from users with few ratings are overrepresented 
in the qualification set, relative to the training set. To 
avoid the issue of dealing with different training and 
test distributions, we also created our own validation 
and test sets, each containing 100,000 ratings that were 
randomly selected from the training set. As a base- 
line, Netflix provided the test score of its own system 
trained on the same data, which is 0.9514. 

This dataset is interesting for several reasons. First, 
it is very large, and very sparse (98.8% sparse). Sec- 
ond, the dataset is very imbalanced with highly non- 
uniform samples. It includes users with over 10,000 
ratings as well as users who rated fewer than 5 movies. 

6.1. Results 

In our first experiment, for various values of a, we fit 
parameters U and V using stochastic gradient descent 
as in (fl~7| with fc = 30. Both U and V were randomly 
initialized for all models and regularization parameters 
A were chosen by cross-validation. 

Performance results of the weighted trace-norm regu- 
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larization for various values of a are shown in table [T] 
Observe that that the weighted trace- norm (a = 1) 
achieved a RMSE of 0.9105 on the Nctflix qualification 
set, significantly outperforming its unweighted coun- 
terpart with a = 0, that achieved a RMSE of 0.9235. 
This large performance gap is striking. It clearly sug- 
gests that the weighting is quite important. Table [T] 
further reveals that the weighted trace- norm (a = I) 
is not optimal. Surprisingly, partially weighted trace- 
norm with a = 0.9 achieved a RMSE of 0.9091, slightly 
outperforming the weighted matrix factorization. Per- 
formance results on the artificially created test set are 
similar to the results on the qualification set. Note 
also that the large gap in generalization performance 
between the test and the qualification sets is due to 
the Netflix's special qualification selection scheme. 

In our second experiment, we fitted much larger mod- 
els with k = 100. As expected, the weighted trace- 
norm regularization (a ~ 1) attained a RMSE 0.9071, 
significantly improving upon the unweighted model's 
RMSE of 0.9203. Again, this large performance gap 
strongly suggests that the weighting can yield signif- 
icant performance boost, particularly when dealing 
with very imbalanced data, such as the Netflix dataset. 

In all of our experiments, we also empirically ob- 
served that for a wide range of regularization parame- 
ters A, optimizing the weighted trace-norm almost al- 
ways yielded better predictions on both the test and 
the Netflix qualification sets than optimizing the un- 
weighted trace-norm. This confirms our previous re- 
sults on the synthetic experiment and strongly sug- 
gests that it may be far easier to search for regulariza- 
tion parameters using the weighted trace-norm. 

7. Discussion 

In this paper we showed both analytically and empir- 
ically that under non-uniform sampling, trace-norm 
regularization can lead to significant performance de- 
terioration and an increase in sample complexity. Mo- 
tivated by our analytic analysis, we further suggested 
a corrected version of the trace-norm, called weighted 
trace-norm, that does take into account the non- 
uniform sampling distribution. Our results on both 
synthetic and highly imbalanced Netflix datasets fur- 
ther demonstrate that the weighted trace-norm yields 
significant improvements in prediction quality. It is 
interesting to note that setting a = 1 in the weighted 
trace-norm objective (| 1 2[) implies that the frequent 
users (movies) get regularized much stronger than 
the rare users (movies). From Bayesian perspective, 
such regularization is quite unusual, since it effectively 
states that the effect of the prior becomes stronger as 



we observe more data. Yet, our analysis and empirical 
results strongly suggest that in non-uniform setting, 
such "unorthodox" regularization is crucial for achiev- 
ing good generalization performance. 

Although theoretical guarantees are not the focus 
of this work, we hope that the weighted trace- 
norm, and the discussions in Sections [3] and |H will 
be helpful in deriving theoretical learning guaran- 
tees for non-uniform sampling distributions, both 
i n the form of gene r alizat ion error bounds as in 
( Srebro fc ShraibmanL 2005 ). and generalizing the 
compressed-sensing insp ired work on recovery of noisy 
low-rank matrices as in (|Candes fc PlanL 120091: iRechtl . 
2009h . 
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